Support Vector Machines Project

For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

Lending club had a very interesting year in 2016, so let's check out some of their data and keep the context in mind. This data is from before they even went public.

We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here or just use the csv already provided. It's recommended you use the csv provided as it has been cleaned of NA values.

Here are what the columns represent:

  • credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
  • purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
  • int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
  • installment: The monthly installments owed by the borrower if the loan is funded.
  • log.annual.inc: The natural log of the self-reported annual income of the borrower.
  • dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
  • fico: The FICO credit score of the borrower.
  • days.with.cr.line: The number of days the borrower has had a credit line.
  • revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
  • revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
  • inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
  • delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
  • pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Data

Open the loan_data.csv file and save it as a dataframe called loans.

In [52]:

Check the summary and structure of loans.

In [53]:
'data.frame':	9578 obs. of  14 variables:
 $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
 $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
 $ installment      : num  829 228 367 162 103 ...
 $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
 $ dti              : num  19.5 14.3 11.6 8.1 15 ...
 $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
 $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
 $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
 $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
 $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
 $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
 $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
 $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...
In [54]:
Out[54]:
 credit.policy                 purpose        int.rate       installment    
 Min.   :0.000   all_other         :2331   Min.   :0.0600   Min.   : 15.67  
 1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039   1st Qu.:163.77  
 Median :1.000   debt_consolidation:3957   Median :0.1221   Median :268.95  
 Mean   :0.805   educational       : 343   Mean   :0.1226   Mean   :319.09  
 3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407   3rd Qu.:432.76  
 Max.   :1.000   major_purchase    : 437   Max.   :0.2164   Max.   :940.14  
                 small_business    : 619                                    
 log.annual.inc        dti              fico       days.with.cr.line
 Min.   : 7.548   Min.   : 0.000   Min.   :612.0   Min.   :  179    
 1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0   1st Qu.: 2820    
 Median :10.929   Median :12.665   Median :707.0   Median : 4140    
 Mean   :10.932   Mean   :12.607   Mean   :710.8   Mean   : 4561    
 3rd Qu.:11.291   3rd Qu.:17.950   3rd Qu.:737.0   3rd Qu.: 5730    
 Max.   :14.528   Max.   :29.960   Max.   :827.0   Max.   :17640    
                                                                    
   revol.bal         revol.util    inq.last.6mths    delinq.2yrs     
 Min.   :      0   Min.   :  0.0   Min.   : 0.000   Min.   : 0.0000  
 1st Qu.:   3187   1st Qu.: 22.6   1st Qu.: 0.000   1st Qu.: 0.0000  
 Median :   8596   Median : 46.3   Median : 1.000   Median : 0.0000  
 Mean   :  16914   Mean   : 46.8   Mean   : 1.577   Mean   : 0.1637  
 3rd Qu.:  18250   3rd Qu.: 70.9   3rd Qu.: 2.000   3rd Qu.: 0.0000  
 Max.   :1207359   Max.   :119.0   Max.   :33.000   Max.   :13.0000  
                                                                     
    pub.rec        not.fully.paid  
 Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000  
 Median :0.00000   Median :0.0000  
 Mean   :0.06212   Mean   :0.1601  
 3rd Qu.:0.00000   3rd Qu.:0.0000  
 Max.   :5.00000   Max.   :1.0000  
                                   

Convert the following columns to categorical data using factor()

  • inq.last.6mths
  • delinq.2yrs
  • pub.rec
  • not.fully.paid
  • credit.policy
In [55]:

EDA

Let's use ggplot 2 to visualize the data!

Create a histogram of fico scores colored by not.fully.paid

In [56]:
In [57]:

Create a barplot of purpose counts, colored by not.fully.paid. Use position=dodge in the geom_bar argument

In [58]:

Create a scatterplot of fico score versus int.rate. Does the trend make sense? Play around with the color scheme if you want.

In [59]:
In [60]:

Building the Model

Now its time to build a model!

Train and Test Sets

Split your data into training and test sets using the caTools library.

In [61]:

Call the e1071 library as shown in the lecture.

In [62]:

Now use the svm() function to train a model on your training set.

In [97]:

Get a summary of the model.

In [98]:
Out[98]:
Call:
svm(formula = not.fully.paid ~ ., data = train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.01724138 

Number of Support Vectors:  2849

 ( 1776 1073 )


Number of Classes:  2 

Levels: 
 0 1


Use predict to predict new values from the test set using your model. Refer to the lecture on how to do this if you don't remember :)

In [99]:
In [100]:
Out[100]:
                
predicted.values    0    1
               0 2413  460
               1    0    0

Tuning the Model

You probably got some not so great results! With the model classifying everything into one group! Let's tune our model to try to fix this.

Use the tune() function to test out different cost and gamma values. In the lecture we showed how to do this by using train.x and train.y, but its usually simpler to just pass a formula. Try checking out help(tune) for more details. This is the end of the project because tuning can take a long time (since its running a bunch of different models!). Take as long or as little time with this step as you would like.

Quick hint, your tune() should look something like this:

tune.results <- tune(svm,train.x=not.fully.paid~., data=train,kernel='radial',
                  ranges=list(cost=some.vector, gamma=some.other.vector))
In [ ]:

In [107]:
Out[107]:
                
predicted.values    0    1
               0 2350  425
               1   63   35

Great Job!